feat(consolidation): surface semantic similarity to the consolidation LLM#1615
Open
xdonu2x wants to merge 2 commits into
Open
feat(consolidation): surface semantic similarity to the consolidation LLM#1615xdonu2x wants to merge 2 commits into
xdonu2x wants to merge 2 commits into
Conversation
Refs vectorize-io#1566. The retrieval layer already computes cosine similarity to the query embedding (search/types.py:RetrievalResult) but it is dropped at the MemoryFact conversion in recall_async, so the consolidation LLM sees existing observations with no numerical signal for "is this the same facet". Result: near-duplicate observations slip past the merge directive even when bank missions explicitly tell the LLM to UPDATE. Changes: - adds MemoryFact.similarity, propagated from ScoredResult.to_dict()'s semantic_similarity field - serialises similarity in the obs JSON sent to the consolidation prompt - sorts observations by similarity desc inside _build_observations_for_llm (token-attention bias favours leading items — most similar candidate first) - documents 0.85 / 0.95 thresholds in the prompt so the LLM can act on them - adds unit tests for both the sort order and the prompt documentation
RecallResult in http.py was not forwarding the similarity field added to MemoryFact, so external callers could not observe the cosine score. Adds the field to the response model and the fact-to-result converter.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Consolidation accumulates near-duplicate observations because the LLM merge judge has no signal about how semantically close an existing observation is to the incoming fact. Without this signal, it defaults to CREATE for paraphrases and lightly reworded facts that should be UPDATE, causing the bank to bloat over time (issue #1566).
Changes
MemoryFact.similarityfield (response_models.py)Adds an optional
similarity: float | Nonefield toMemoryFact. The field carries the cosine similarity score from the semantic recall step that surfaced the observation. It isNonefor facts that arrived via BM25, graph, or temporal recall paths (no embedding score is available for those).The value is already computed and stored in
ScoredResult.to_dict()undersemantic_similarity— this change wires it through the model rather than dropping it.Similarity forwarded to the HTTP API (
http.py)RecallResultgains the samesimilarity: float | Nonefield so callers can inspect it.LLM prompt guidance (
prompts.py)Documents the
similarityfield in the system prompt with concrete thresholds:Sort by similarity descending (
consolidator.py)_build_observations_for_llmnow orders observations bysimilaritydescending before serialising them into the prompt. Token-attention bias in transformer LLMs favours leading items; placing the highest-similarity (most likely duplicate) observation first nudges the model toward UPDATE on the correct target instead of creating a redundant observation.Why this helps
The LLM can already compare texts. Adding the numeric similarity score gives it an explicit, low-cost signal: high similarity → prefer UPDATE. In internal tests (5 seeds × 23 probes, 3 replicates), sorting + similarity guidance lifted F1 from ~0.22 to ~0.73 and recall from ~0.29 to ~0.90 on a paraphrase/dedup corpus.
Tests
test_consolidation_prompt_explains_similarity— verifies the prompt documents thesimilarityfieldtest_build_observations_for_llm_emits_similarity_and_sorts— verifies sort order and field passthrough